quollr: An R Package for Visalizing 2D Models in High Dimensional Space

An abstract of less than 150 words.

Jayani P.G. Lakshika https://jayanilakshika.netlify.app/ (Monash University) , Dianne Cook http://www.dicook.org/ (Monash University) , Paul Harrison (Monash University) , Michael Lydeamore (Monash University) , Thiyanga S. Talagala https://thiyanga.netlify.app/ (University of Sri Jayewardenepura)
2024-02-20
#library(quollr)
library(knitr)
library(kableExtra)
library(readr)
library(ggplot2)
library(dplyr)
library(ggbeeswarm)
library(Rtsne)
library(umap)
library(phateR)
library(reticulate)
library(rsample)

set.seed(20230531)

use_python("~/miniforge3/envs/pcamp_env/bin/python")
use_condaenv("pcamp_env")

reticulate::source_python(paste0(here::here(), "/scripts/function_scripts/Fit_PacMAP_code.py"))
reticulate::source_python(paste0(here::here(), "/scripts/function_scripts/Fit_TriMAP_code.py"))

1 Introduction

2 Methodology

2.1 Usage

library(tools)
package_dependencies("quollr")

Datasets

The quollr package comes with several data sets that load with the package. These are described in Table 1.

Table 1: quollr datasets
data explanation
s_curve_noise Simulated 3D S-curve data with additional four noise dimensions.
s_curve_noise_training Training data derived from S-curve data.
s_curve_noise_test Test data derived from S-curve data.
s_curve_noise_umap UMAP 2D embedding data of S-curve data (n_neighbors: 15, min_dist: 0.1).

Compute hexagonal bin configurations

Hexagonal binning is a powerful technique for visualizing the density of data points in a 2-d space. Unlike traditional rectangular bins, hexagonal bins offer several advantages, including a more uniform representation of density and reduced visual bias. However, to effectively do the hexagonal binning, it’s essential to compute the appropriate configurations based on the characteristics of the dataset.

Before computing hexagonal bin configurations, we need to determine the range of data along the x and y axes to establish the boundary for hexagonal binning. Additionally, choosing an appropriate hexagon size (the radius of the outer circle) is essential. By default, the calculate_effective_x_bins() and calculate_effective_y_bins() functions use a hexagon size of 1.07457. However, users can adjust the hexagon size to fit the data range and achieve regular hexagons without overlapping.

The hexagon size directly affects the number of bins generated. A higher hexagonal size will result in fewer bins, while a lower hexagonal size will lead to more bins. Therefore, there’s always a trade-off depending on the dataset used. Users should consider their specific data characteristics when selecting the hexagon size.

num_bins_x <- calculate_effective_x_bins(.data = s_curve_noise_umap, 
                                         x = "UMAP1", hex_size = NA)
num_bins_x
[1] 4
num_bins_y <- calculate_effective_y_bins(.data = s_curve_noise_umap, 
                                         y = "UMAP2", hex_size = NA)
num_bins_y
[1] 8

Generate full hex grid

Generating full hexagonal grid contains main three steps:

  1. Generate all the hexagonal bin centroids

Steps:

cell_area <- 1

hex_size <- sqrt(2 * cell_area / sqrt(3))

buffer_size <- hex_size/2

x_bounds <- seq(min(s_curve_noise_umap[["UMAP1"]]) - buffer_size,
                  max(s_curve_noise_umap[["UMAP1"]]) + buffer_size, length.out = num_bins_x)

y_bounds <- seq(min(s_curve_noise_umap[["UMAP2"]]) - buffer_size,
                max(s_curve_noise_umap[["UMAP2"]]) + buffer_size, length.out = num_bins_y)

box_points <- expand.grid(x = x_bounds, y = y_bounds)

ggplot() +
  geom_point(data = box_points, aes(x = x, y = y), color = "red")

 box_points <- box_points |>
    dplyr::arrange(x) |>
    dplyr::group_by(x) |>
    dplyr::group_modify(~ generate_even_y(.x)) |>
    tibble::as_tibble()

ggplot() +
  geom_point(data = box_points,
             aes(x = x, y = y, colour = as.factor(is_even)))

## Shift for even values in x-axis
x_shift <- unique(box_points$x)[2] - unique(box_points$x)[1]


box_points$x <- box_points$x + x_shift/2 * ifelse(box_points$is_even == 1, 1, 0)

ggplot() +
  geom_point(data = box_points, aes(x = x, y = y), color = "red")

all_centroids_df <- generate_full_grid_centroids(nldr_df = s_curve_noise_umap, 
                                                 x = "UMAP1", y = "UMAP2", 
                                                 num_bins_x = num_bins_x, 
                                                 num_bins_y = num_bins_y, 
                                                 buffer_size = NA, hex_size = NA)

glimpse(all_centroids_df)
Rows: 32
Columns: 2
$ x <dbl> -3.8076427, -2.6742223, -3.8076427, -2.6742223, -3.8076427…
$ y <dbl> -6.2798254, -4.4744481, -2.6690708, -0.8636935, 0.9416838,…
  1. Generate hexagonal coordinates

Steps: - Compute horizontal width of the hexagon

hex_grid <- gen_hex_coordinates(all_centroids_df, hex_size = NA)
glimpse(hex_grid)
Rows: 192
Columns: 3
$ x  <dbl> -2.674222, -2.674222, -3.807643, -4.941063, -4.941063, -3…
$ y  <dbl> -5.6804828, -6.8791681, -7.4785108, -6.8791681, -5.680482…
$ id <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, …
ggplot(data = hex_grid, aes(x = x, y = y)) + geom_polygon(fill = "white", color = "black", aes(group = id)) +
  geom_point(data = all_centroids_df, aes(x = x, y = y), color = "red")

  1. Map hexagonal IDs

Steps:

full_grid_with_hexbin_id <- map_hexbin_id(all_centroids_df)

ggplot(data = hex_grid, aes(x = x, y = y)) + geom_polygon(fill = "white", color = "black", aes(group = id)) +
  geom_text(data = full_grid_with_hexbin_id, aes(x = c_x, y = c_y, label = hexID))

  1. Map polygon IDs

Steps:

full_grid_with_polygon_id <- map_polygon_id(full_grid_with_hexbin_id, hex_grid)
  1. Assign data into hexagons
s_curve_noise_umap_with_id <- assign_data(s_curve_noise_umap, full_grid_with_hexbin_id)
  1. Compute standardized counts
df_with_std_counts <- compute_std_counts(nldr_df = s_curve_noise_umap_with_id)
  1. Extract full grid info
hex_full_count_df <- generate_full_grid_info(full_grid_with_polygon_id, df_with_std_counts, hex_grid)
ggplot(data = hex_grid, aes(x = x, y = y)) + geom_polygon(fill = "white", color = "black", aes(group = id)) +
  geom_point(data = s_curve_noise_umap, aes(x = UMAP1, y = UMAP2), color = "blue")

ggplot(data = hex_full_count_df, aes(x = x, y = y)) +
  geom_polygon(color = "black", aes(group = polygon_id, fill = std_counts)) +
  geom_text(aes(x = c_x, y = c_y, label = hexID)) +
  scale_fill_viridis_c(direction = -1, na.value = "#ffffff")

Buffer size

When generating hexagonal bins in R, a buffer is often included to ensure that the data points are evenly distributed within the bins and to prevent edge effects. The buffer helps in two main ways:

  1. Preventing Edge Effects: Without a buffer, the outermost data points might fall near the boundary of the hexagonal grid, leading to incomplete bins or uneven distribution of data. By adding a buffer, you create a margin around the outer edges of the grid, ensuring that all data points are fully enclosed within the bins.

  2. Ensuring Even Distribution: The buffer allows for a smoother transition between adjacent bins. This helps in cases where data points are not perfectly aligned with the grid lines, ensuring that each data point is assigned to the nearest bin without bias towards any specific direction.

Overall, including a buffer when generating hexagonal bins helps to produce more accurate and robust binning results, particularly when dealing with real-world data that may have irregular distributions or boundary effects.

Construct the 2D model with different options

df_bin_centroids <- hex_full_count_df[complete.cases(hex_full_count_df[["std_counts"]]), ] |>
  dplyr::select("c_x", "c_y", "hexID", "std_counts") |>
  dplyr::distinct() |>
  dplyr::rename(c("x" = "c_x", "y" = "c_y"))
  
df_bin_centroids
            x          y hexID std_counts
1  -2.6742223 -4.4744481     5     1.0000
2  -1.5408019 -6.2798254     2     0.3125
3  -0.4073814 -4.4744481     6     0.0625
4  -1.5408019 -2.6690708    10     0.2500
5  -0.4073814 -0.8636935    14     0.5000
6   0.7260390 -2.6690708    11     0.1250
7   1.8594594 -0.8636935    15     0.1875
8   0.7260390  0.9416838    19     0.6250
9   1.8594594  2.7470611    23     0.2500
10  0.7260390  4.5524384    27     0.5625
11  1.8594594  6.3578158    31     0.3750
12  2.9928798  4.5524384    28     0.4375

Construct the high-D model with different options

## To generate a data set with high-D and 2D training data
df_all <- training_data |> dplyr::select(-ID) |>
  dplyr::bind_cols(s_curve_noise_umap_with_id)

## To generate averaged high-D data

df_bin <- avg_highD_data(.data = df_all, column_start_text = "x") ## Need to pass ID column name

Generate the triangular mesh

tr1_object <- triangulate_bin_centroids(df_bin_centroids, x = "x", y = "y")
tr_from_to_df <- generate_edge_info(triangular_object = tr1_object)

Compute parameter defaults

Shift the hexagonal grid origin

If shift_x happen to the positive direction of x it should input as a positive value, if not other way If shift_y happen to the positive direction of y it should input as a positive value, if not other way

  1. Assign shift along the x and y axis (limited the amount should less than the cell_diameter)

  2. Generate bounds with shift origin

all_centroids_df_shift <- extract_coord_of_shifted_hex_grid(nldr_df = s_curve_noise_umap, 
                                                 x = "UMAP1", y = "UMAP2", 
                                                 num_bins_x = num_bins_x, 
                                                 num_bins_y = num_bins_y,
                                                 shift_x = 0.2690002, shift_y = 0.271183,
                                                 buffer_size = NA, hex_size = NA)

glimpse(all_centroids_df_shift)
Rows: 32
Columns: 2
$ x <dbl> -3.8076427, -2.6742223, -3.8076427, -2.6742223, -3.8076427…
$ y <dbl> -6.2798254, -4.4744481, -2.6690708, -0.8636935, 0.9416838,…
hex_grid <- gen_hex_coordinates(all_centroids_df_shift)
glimpse(hex_grid)
Rows: 192
Columns: 3
$ x  <dbl> -2.674222, -2.674222, -3.807643, -4.941063, -4.941063, -3…
$ y  <dbl> -5.6804828, -6.8791681, -7.4785108, -6.8791681, -5.680482…
$ id <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, …
ggplot(data = hex_grid, aes(x = x, y = y)) + geom_polygon(fill = "white", color = "black", aes(group = id)) +
  geom_point(data = all_centroids_df_shift, aes(x = x, y = y), color = "red")

full_grid_with_hexbin_id <- map_hexbin_id(all_centroids_df_shift)

ggplot(data = hex_grid, aes(x = x, y = y)) + geom_polygon(fill = "white", color = "black", aes(group = id)) +
  geom_text(data = full_grid_with_hexbin_id, aes(x = c_x, y = c_y, label = hexID))

full_grid_with_polygon_id <- map_polygon_id(full_grid_with_hexbin_id, hex_grid)
s_curve_noise_umap_with_id <- assign_data(s_curve_noise_umap, full_grid_with_hexbin_id)
df_with_std_counts <- compute_std_counts(nldr_df = s_curve_noise_umap_with_id)
hex_full_count_df <- generate_full_grid_info(full_grid_with_polygon_id, df_with_std_counts, hex_grid)
ggplot(data = hex_grid, aes(x = x, y = y)) + geom_polygon(fill = "white", color = "black", aes(group = id)) +
  geom_point(data = s_curve_noise_umap, aes(x = UMAP1, y = UMAP2), color = "blue")

ggplot(data = hex_full_count_df, aes(x = x, y = y)) +
  geom_polygon(color = "black", aes(group = polygon_id, fill = std_counts)) +
  geom_text(aes(x = c_x, y = c_y, label = hexID)) +
  scale_fill_viridis_c(direction = -1, na.value = "#ffffff")

df_bin_centroids <- hex_full_count_df[complete.cases(hex_full_count_df[["std_counts"]]), ] |>
  dplyr::select("c_x", "c_y", "hexID", "std_counts") |>
  dplyr::distinct() |>
  dplyr::rename(c("x" = "c_x", "y" = "c_y"))

df_bin_centroids
            x          y hexID std_counts
1  -2.6742223 -4.4744481     5     1.0000
2  -1.5408019 -6.2798254     2     0.3125
3  -0.4073814 -4.4744481     6     0.0625
4  -1.5408019 -2.6690708    10     0.2500
5  -0.4073814 -0.8636935    14     0.5000
6   0.7260390 -2.6690708    11     0.1250
7   1.8594594 -0.8636935    15     0.1875
8   0.7260390  0.9416838    19     0.6250
9   1.8594594  2.7470611    23     0.2500
10  0.7260390  4.5524384    27     0.5625
11  1.8594594  6.3578158    31     0.3750
12  2.9928798  4.5524384    28     0.4375
tr1_object <- triangulate_bin_centroids(df_bin_centroids, x = "x", y = "y")
tr_from_to_df <- generate_edge_info(triangular_object = tr1_object)
bin_centroids_shift <- ggplot(data = hex_full_count_df, aes(x = c_x, y = c_y)) +
  geom_point(color = "#bdbdbd") +
  geom_point(data = shifted_hex_coord_df, aes(x = c_x, y = c_y), color = "#feb24c") +
  coord_cartesian(xlim = c(-5, 8), ylim = c(-10, 10)) +
  theme_void() +
  theme(legend.position="none", legend.direction="horizontal", plot.title = element_text(size = 7, hjust = 0.5, vjust = -0.5),
        axis.title.x = element_blank(), axis.title.y = element_blank(),
        axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        axis.text.y = element_blank(), axis.ticks.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(), #change legend key width
        legend.title = element_text(size=8), #change legend title font size
        legend.text = element_text(size=6)) +
  guides(fill = guide_colourbar(title = "Standardized count")) +
  annotate(geom = 'text', label = "a", x = -Inf, y = Inf, hjust = -0.3, vjust = 1, size = 3) 

hex_grid_shift <- ggplot(data = shifted_hex_coord_df, aes(x = x, y = y)) +
  geom_polygon(fill = NA, color = "#feb24c", aes(group = polygon_id)) +
  geom_polygon(data = hex_full_count_df, aes(x = x, y = y, group = polygon_id),
               fill = NA, color = "#bdbdbd") +
  coord_cartesian(xlim = c(-5, 8), ylim = c(-10, 10)) +
  theme_void() +
  theme(legend.position="none", legend.direction="horizontal", plot.title = element_text(size = 7, hjust = 0.5, vjust = -0.5),
        axis.title.x = element_blank(), axis.title.y = element_blank(),
        axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        axis.text.y = element_blank(), axis.ticks.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(), #change legend key width
        legend.title = element_text(size=8), #change legend title font size
        legend.text = element_text(size=6)) +
  guides(fill = guide_colourbar(title = "Standardized count")) +
  annotate(geom = 'text', label = "b", x = -Inf, y = Inf, hjust = -0.3, vjust = 1, size = 3) 

## Before shift
before_shift_plot <- ggplot(data = hex_full_count_df, aes(x = x, y = y)) +
  geom_polygon(color = "black", aes(group = polygon_id, fill = std_counts)) +
  geom_text(aes(x = c_x, y = c_y, label = hexID), size = 2) +
  scale_fill_viridis_c(direction = -1, na.value = "#ffffff", option = "C") +
  coord_equal() +
  theme_void() +
  theme(legend.position="bottom", legend.direction="horizontal", plot.title = element_text(size = 7, hjust = 0.5, vjust = -0.5),
        axis.title.x = element_blank(), axis.title.y = element_blank(),
        axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        axis.text.y = element_blank(), axis.ticks.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(), #change legend key width
        legend.title = element_text(size=8), #change legend title font size
        legend.text = element_text(size=6)) +
  guides(fill = guide_colourbar(title = "Standardized count")) +
  annotate(geom = 'text', label = "a", x = -Inf, y = Inf, hjust = -0.3, vjust = 1, size = 3) 


## After shift
after_shift_plot <- ggplot(data = shifted_hex_coord_df, aes(x = x, y = y)) +
  geom_polygon(color = "black", aes(group = polygon_id, fill = std_counts)) +
  geom_text(aes(x = c_x, y = c_y, label = hexID), size = 2) +
  scale_fill_viridis_c(direction = -1, na.value = "#ffffff", option = "C") +
  coord_equal() +
  theme_void() +
  theme(legend.position="none", legend.direction="horizontal", plot.title = element_text(size = 7, hjust = 0.5, vjust = -0.5),
        axis.title.x = element_blank(), axis.title.y = element_blank(),
        axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        axis.text.y = element_blank(), axis.ticks.y = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(), #change legend key width
        legend.title = element_text(size=8), #change legend title font size
        legend.text = element_text(size=6)) +
  guides(fill = guide_colourbar(title = "Standardized count")) +
  annotate(geom = 'text', label = "b", x = -Inf, y = Inf, hjust = -0.3, vjust = 1, size = 3) 
Benchmark value to remove the low-density hexagons
## As an option first quantile considered as a default
benchmark_to_rm_lwd_hex <- quantile(df_bin_centroids$std_counts)[2] + 0.01

## To identify low density hexagons
df_bin_centroids_low <- df_bin_centroids |>
  dplyr::filter(std_counts <= benchmark_to_rm_lwd_hex)

## To identify low-density hexagons needed to remove by investigating neighbouring mean density
identify_rm_bins <- find_low_density_hexagons(df_bin_centroids_all = df_bin_centroids, num_bins_x = num_bins_x,
                     df_bin_centroids_low = df_bin_centroids_low)
Benchmark value to remove the long edges
## Compute 2D distances
distance <- cal_2d_dist(tr_from_to_df_coord = tr_from_to_df)

## To plot the distribution of distance
plot_dist <- function(distance_df){
  distance_df$group <- "1"
  dist_plot <- ggplot(distance_df, aes(x = group, y = distance)) +
    geom_quasirandom()+
    ylim(0, max(unlist(distance_df$distance))+ 0.5) + coord_flip()
  return(dist_plot)
}

plot_dist(distance)
benchmark <- find_benchmark_value(distance_edges = distance, distance_col = "distance")
benchmark
[1] 2.603
benchmark <- 5

Model function

fit_high_d_model() function is used to generate the 2D model and the high-D model.

fit_high_d_model(training_data = training_data, nldr_df_with_id = s_curve_noise_umap, x = "UMAP1",
                             y = "UMAP1", num_bins_x = NA, num_bins_y = NA,
                             hex_size = NA, buffer_size = NA,
                             is_bin_centroid = TRUE,
                             is_rm_lwd_hex = FALSE,
                             benchmark_to_rm_lwd_hex = NA,
                             is_avg_high_d = TRUE, column_start_text = "x")
$df_bin
# A tibble: 6 × 8
  hb_id     x1    x2     x3        x4        x5        x6        x7
  <int>  <dbl> <dbl>  <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1     1 -0.470 0.473 -1.75   0.00121   0.00244  -0.0314   -0.000525
2     2  0.170 1.23  -1.76   0.00610  -0.00313   0.00286  -0.00123 
3     6  0.760 1.19  -0.407 -0.000154  0.00342   0.0219    0.000410
4    11 -0.452 1.17   0.294  0.00414  -0.000295  0.00334   0.00207 
5    14 -0.613 1.69   1.79   0.0119   -0.00418  -0.000519 -0.00251 
6    15  0.366 0.999  1.70  -0.00115   0.00375  -0.00747  -0.000931

$df_bin_centroids
           x         y hexID std_counts
1 -3.8076427 -3.807643     1 0.41666667
2 -1.5408019 -3.807643     2 0.62500000
3 -0.4073814 -1.540802     6 0.41666667
4 -0.4073814  2.992880    14 0.08333333
5  0.7260390  0.726039    11 0.58333333
6  1.8594594  2.992880    15 1.00000000
## To generate a data set with high-D and 2D training data
df_all <- training_data |> dplyr::select(-ID) |>
  dplyr::bind_cols(s_curve_noise_umap_with_id)

## To generate averaged high-D data

df_bin <- avg_highD_data(.data = df_all, column_start_text = "x") ## Need to pass ID column name

Predict 2D embeddings

To predict the 2D embeddings for a new data point using the trained high-D model, we follow a series of steps outlined below:

  1. Compute high-dimensional Euclidean distance: Calculate the Euclidean distance between each new data point and the lift model’s coordinates in the high-dimensional space. This distance metric helps identify the nearest lift model coordinates to each new data point.

  2. Find the nearest lift model coordinates: Determine the lift model’s high-dimensional coordinates that are closest to each new data point based on the computed distances. This step helps establish a correspondence between the new data points and the lift model’s representation in the high-dimensional space.

  3. Map the hexagonal bin ID: Assign each new data point to a hexagonal bin based on its nearest lift model coordinates. This mapping ensures that each data point is associated with a specific region in the hexagonal grid representation of the high-dimensional space.

  4. Map the hexagonal bin centroid coordinates in 2D: Transform the hexagonal bin centroid coordinates from the high-dimensional space to the 2D embedding space using the trained lift model. This mapping provides the 2D embeddings for the new data points, allowing them to be visualized and analyzed in a lower-dimensional space.

The predict_2d_embeddings() function is used to predict 2D embeddings for a new data point. The inputs for the function are test data, bin centroid coordinates in 2D, lifting model coordinates in high-D, and the type of the NLDR technique. The output contains the predicted 2D embeddings along with the predicted hexagonal id and the ID of the test data.

pred_df_test <- predict_2d_embeddings(test_data = training_data,
df_bin_centroids = df_bin_centroids, df_bin = df_bin, type_NLDR = "UMAP")

glimpse(pred_df_test)
Rows: 75
Columns: 4
$ pred_UMAP_1 <dbl> -2.674222, 1.859459, 0.726039, -1.540802, -1.540…
$ pred_UMAP_2 <dbl> -4.4744481, -0.8636935, 0.9416838, -6.2798254, -…
$ ID          <int> 1, 2, 3, 4, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, …
$ pred_hb_id  <int> 5, 15, 19, 2, 10, 31, 15, 11, 15, 28, 11, 5, 28,…

Making summaries

There are two important summaries that should be made when constructing the model on a dataset using a specific NLDR technique, the Mean Square Error (MSE) and Akaike Information Criterion (AIC) which measure prediction accuracy. The function that perform the summaries called generate_eval_df(). The output of this function is a list which contains MSE and AIC. The code below uses generate_eval_df() to find the MSE and the AIC values for the model constructed on UMAP 2D embeddings generated from the s_curve_noise training data.

generate_summary(test_data = training_data, prediction_df = pred_df_test,
                 df_bin = df_bin, col_start = "x")
$mse
[1] 0.2983091

$aic
[1] -467.0531

Visualizations

We use static visualizations to understand the model constructed in 2D. On the other hand, dynamic visualization is used to see how the model looks in high-D space.

Static visualizations

Static visualizations main involves two types of results. One is the triangulation and the other is the long edge removal. Both types of visualizations provide ggplot objects.

Dynamic visaulizations

The show_langevitour() function enables dynamic visualization of the 2D model alongside the high-dimensional (high-D) data in its original space. This visualization is facilitated by langevitour object, allowing users to interactively explore the relationship between the 2D embeddings and the underlying high-dimensional data. The main arguments are shown in Table

trimesh <- ggplot(df_bin_centroids, aes(x = x, y = y)) +
  geom_point(size = 0.1) +
  geom_trimesh() +
  coord_equal()

trimesh

trimesh_gr <- colour_long_edges(distance_edges = distance, benchmark_value = benchmark,
                                tr_from_to_df_coord = tr_from_to_df, distance_col = "distance")

trimesh_gr

trimesh_removed <- remove_long_edges(distance_edges = distance, benchmark_value = benchmark,
                                     tr_from_to_df_coord = tr_from_to_df, distance_col = "distance")
trimesh_removed

tour1 <- show_langevitour(df_all, df_bin, df_bin_centroids, 
                          benchmark_value = benchmark,
                          distance = distance, distance_col = "distance", 
                          use_default_benchmark_val = FALSE, 
                          column_start_text = "x")
tour1

Find non-empty bins

2.2 Tests

All functions have tests written and implemented using the testthat (Wickham 2011) in R.

3 Application

medlea_df <- read_csv("data/medlea_dataset.csv")
names(medlea_df)[2:(NCOL(medlea_df) - 1)] <- paste0("x", 1:(NCOL(medlea_df) - 2))

medlea_df <- medlea_df |> ## Since only contains zeros
  select(-x10)

#medlea_df[,2:(NCOL(medlea_df) - 1)] <- scale(medlea_df[,2:(NCOL(medlea_df) - 1)])

calculate_pca <- function(feature_dataset, num_pcs){
  pcaY_cal <- prcomp(feature_dataset, center = TRUE, scale = TRUE)
  PCAresults <- data.frame(pcaY_cal$x[, 1:num_pcs])
  summary_pca <- summary(pcaY_cal)
  var_explained_df <- data.frame(PC= paste0("PC",1:50),
                               var_explained=(pcaY_cal$sdev[1:50])^2/sum((pcaY_cal$sdev[1:50])^2))
  return(list(prcomp_out = pcaY_cal,pca_components = PCAresults, summary = summary_pca, var_explained_pca  = var_explained_df))
}
features <- medlea_df[,2:(NCOL(medlea_df) - 1)]
pca_ref_calc <- calculate_pca(features, 8) 
pca_ref_calc$summary
Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6
Standard deviation     3.1691 3.0609 2.7226 1.87967 1.71219 1.34192
Proportion of Variance 0.1969 0.1837 0.1453 0.06928 0.05748 0.03531
Cumulative Proportion  0.1969 0.3806 0.5260 0.59526 0.65274 0.68805
                           PC7     PC8     PC9    PC10    PC11
Standard deviation     1.27525 1.16992 1.13465 1.06628 1.03279
Proportion of Variance 0.03189 0.02684 0.02524 0.02229 0.02091
Cumulative Proportion  0.71993 0.74677 0.77202 0.79431 0.81522
                          PC12    PC13   PC14   PC15   PC16    PC17
Standard deviation     0.97899 0.96264 0.9528 0.9116 0.9090 0.79750
Proportion of Variance 0.01879 0.01817 0.0178 0.0163 0.0162 0.01247
Cumulative Proportion  0.83402 0.85219 0.8700 0.8863 0.9025 0.91496
                          PC18    PC19    PC20    PC21   PC22    PC23
Standard deviation     0.76725 0.72414 0.65310 0.61052 0.6019 0.55399
Proportion of Variance 0.01154 0.01028 0.00836 0.00731 0.0071 0.00602
Cumulative Proportion  0.92650 0.93678 0.94514 0.95245 0.9596 0.96557
                          PC24    PC25    PC26   PC27    PC28    PC29
Standard deviation     0.52293 0.46638 0.41959 0.3976 0.34697 0.33415
Proportion of Variance 0.00536 0.00426 0.00345 0.0031 0.00236 0.00219
Cumulative Proportion  0.97093 0.97520 0.97865 0.9818 0.98411 0.98630
                          PC30    PC31    PC32    PC33    PC34
Standard deviation     0.30618 0.29237 0.28458 0.26033 0.25420
Proportion of Variance 0.00184 0.00168 0.00159 0.00133 0.00127
Cumulative Proportion  0.98814 0.98982 0.99140 0.99273 0.99400
                          PC35    PC36    PC37    PC38   PC39    PC40
Standard deviation     0.22792 0.21644 0.20437 0.19127 0.1744 0.15586
Proportion of Variance 0.00102 0.00092 0.00082 0.00072 0.0006 0.00048
Cumulative Proportion  0.99502 0.99594 0.99676 0.99747 0.9981 0.99855
                          PC41    PC42    PC43    PC44    PC45
Standard deviation     0.15252 0.12519 0.10485 0.08598 0.08008
Proportion of Variance 0.00046 0.00031 0.00022 0.00014 0.00013
Cumulative Proportion  0.99900 0.99931 0.99952 0.99967 0.99980
                          PC46    PC47    PC48    PC49    PC50
Standard deviation     0.06491 0.04841 0.04094 0.03791 0.02347
Proportion of Variance 0.00008 0.00005 0.00003 0.00003 0.00001
Cumulative Proportion  0.99988 0.99992 0.99996 0.99999 1.00000
                          PC51
Standard deviation     0.01421
Proportion of Variance 0.00000
Cumulative Proportion  1.00000
var_explained_df <- pca_ref_calc$var_explained_pca
data_pca <- pca_ref_calc$pca_components |>
  mutate(ID = 1:NROW(pca_ref_calc$pca_components),
         shape_label = medlea_df$Shape_label)

var_explained_df |>
  ggplot(aes(x = PC,y = var_explained, group = 1))+
  geom_point(size=1)+
  geom_line()+
  labs(title="Scree plot: PCA on scaled data") +
  scale_x_discrete(limits = paste0(rep("PC", 50), 1:50)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
data_split <- initial_split(data_pca)
training_data <- training(data_split) |>
  arrange(ID)
test_data <- testing(data_split) |>
  arrange(ID)
UMAP_fit <- umap(training_data |> dplyr::select(-c(ID, shape_label)), n_neighbors = 37, n_components =  2)

UMAP_data <- UMAP_fit$layout |>
  as.data.frame()
names(UMAP_data)[1:(ncol(UMAP_data))] <- paste0(rep("UMAP",(ncol(UMAP_data))), 1:(ncol(UMAP_data)))

UMAP_data <- UMAP_data |>
  mutate(ID = training_data$id)

UMAP_data_with_label <- UMAP_data |>
  mutate(shape_label = training_data$shape_label)
UMAP_data_with_label |>
    ggplot(aes(x = UMAP1,
               y = UMAP2, color = shape_label))+
    geom_point(alpha=0.5) +
    coord_equal() +
    theme(plot.title = element_text(hjust = 0.5, size = 18, face = "bold")) + #ggtitle("(a)") +
  theme_linedraw() +
    theme(legend.position = "none", plot.title = element_text(size = 7, hjust = 0.5, vjust = -0.5),
              axis.title.x = element_blank(), axis.title.y = element_blank(),
              axis.text.x = element_blank(), axis.ticks.x = element_blank(),
              axis.text.y = element_blank(), axis.ticks.y = element_blank(),
              panel.grid.major = element_blank(), panel.grid.minor = element_blank(), #change legend key width
        legend.title = element_text(size=5), #change legend title font size
        legend.text = element_text(size=4),
         legend.key.height = unit(0.25, 'cm'),
         legend.key.width = unit(0.25, 'cm')) +
  scale_color_manual(values=c("#b15928", "#1f78b4", "#cab2d6", "#ccebc5", "#fb9a99", "#e31a1c", "#6a3d9a", "#ff7f00", "#ffed6f", "#fdbf6f", "#ffff99", "#a6cee3", "#8dd3c7", "#ffffb3", "#bebada", "#fb8072", "#80b1d3", "#fdb462", "#b3de69", "#fccde5", "#d9d9d9", "#b2df8a", "#bc80bd", "#33a02c", "#ccebc5", "#ffed6f", "#000000", "#bdbdbd"))

tSNE_data <- Fit_tSNE(training_data |> dplyr::select(-c(ID, shape_label)), opt_perplexity = calculate_effective_perplexity(training_data |> dplyr::select(-c(ID, shape_label))), with_seed = 20240110)

tSNE_data <- tSNE_data |>
  select(-ID) |>
  mutate(ID = training_data$ID)

tSNE_data_with_label <- tSNE_data |>
  mutate(shape_label = training_data$shape_label)

tSNE_data_with_label |>
    ggplot(aes(x = tSNE1,
               y = tSNE2, color = shape_label))+
    geom_point(alpha=0.5) +
    coord_equal() +
    theme(plot.title = element_text(hjust = 0.5, size = 18, face = "bold")) + #ggtitle("(a)") +
  theme_linedraw() +
    theme(legend.position = "none", plot.title = element_text(size = 7, hjust = 0.5, vjust = -0.5),
              axis.title.x = element_blank(), axis.title.y = element_blank(),
              axis.text.x = element_blank(), axis.ticks.x = element_blank(),
              axis.text.y = element_blank(), axis.ticks.y = element_blank(),
              panel.grid.major = element_blank(), panel.grid.minor = element_blank(), #change legend key width
        legend.title = element_text(size=5), #change legend title font size
        legend.text = element_text(size=4),
         legend.key.height = unit(0.25, 'cm'),
         legend.key.width = unit(0.25, 'cm')) +
  scale_color_manual(values=c("#b15928", "#1f78b4", "#cab2d6", "#ccebc5", "#fb9a99", "#e31a1c", "#6a3d9a", "#ff7f00", "#ffed6f", "#fdbf6f", "#ffff99", "#a6cee3", "#8dd3c7", "#ffffb3", "#bebada", "#fb8072", "#80b1d3", "#fdb462", "#b3de69", "#fccde5", "#d9d9d9", "#b2df8a", "#bc80bd", "#33a02c", "#ccebc5", "#ffed6f", "#000000", "#bdbdbd"))

PHATE_data <- Fit_PHATE(training_data |> dplyr::select(-c(ID, shape_label)), knn = 5, with_seed = 20240110)
Calculating PHATE...
  Running PHATE on 824 observations and 8 variables.
  Calculating graph and diffusion operator...
    Calculating KNN search...
    Calculated KNN search in 0.01 seconds.
    Calculating affinities...
  Calculated graph and diffusion operator in 0.02 seconds.
  Calculating optimal t...
    Automatically selected t = 24
  Calculated optimal t in 3.16 seconds.
  Calculating diffusion potential...
  Calculated diffusion potential in 0.32 seconds.
  Calculating metric MDS...
  Calculated metric MDS in 11.28 seconds.
Calculated PHATE in 14.79 seconds.
PHATE_data <- PHATE_data |>
  select(PHATE1, PHATE2)
PHATE_data <- PHATE_data |>
  mutate(ID = training_data$ID)

PHATE_data_with_label <- PHATE_data |>
  mutate(shape_label = training_data$shape_label)

PHATE_data_with_label |>
    ggplot(aes(x = PHATE1,
               y = PHATE2, color = shape_label))+
    geom_point(alpha=0.5) +
    coord_equal() +
    theme(plot.title = element_text(hjust = 0.5, size = 18, face = "bold")) + #ggtitle("(a)") +
  theme_linedraw() +
    theme(legend.position = "none", plot.title = element_text(size = 7, hjust = 0.5, vjust = -0.5),
              axis.title.x = element_blank(), axis.title.y = element_blank(),
              axis.text.x = element_blank(), axis.ticks.x = element_blank(),
              axis.text.y = element_blank(), axis.ticks.y = element_blank(),
              panel.grid.major = element_blank(), panel.grid.minor = element_blank(), #change legend key width
        legend.title = element_text(size=5), #change legend title font size
        legend.text = element_text(size=4),
         legend.key.height = unit(0.25, 'cm'),
         legend.key.width = unit(0.25, 'cm')) +
  scale_color_manual(values=c("#b15928", "#1f78b4", "#cab2d6", "#ccebc5", "#fb9a99", "#e31a1c", "#6a3d9a", "#ff7f00", "#ffed6f", "#fdbf6f", "#ffff99", "#a6cee3", "#8dd3c7", "#ffffb3", "#bebada", "#fb8072", "#80b1d3", "#fdb462", "#b3de69", "#fccde5", "#d9d9d9", "#b2df8a", "#bc80bd", "#33a02c", "#ccebc5", "#ffed6f", "#000000", "#bdbdbd"))

tem_dir <- tempdir()

Fit_TriMAP_data(training_data |> dplyr::select(-c(ID, shape_label)), tem_dir)

path <- file.path(tem_dir, "df_2_without_class.csv")
path2 <- file.path(tem_dir, "dataset_3_TriMAP_values.csv")

Fit_TriMAP(as.integer(2), as.integer(5), as.integer(4), as.integer(3), path, path2)

TriMAP_data <- read_csv(path2)
TriMAP_data <- TriMAP_data |>
  mutate(ID = training_data$ID)

TriMAP_data_with_label <- TriMAP_data |>
  mutate(shape_label = training_data$shape_label)

TriMAP_data_with_label |>
    ggplot(aes(x = TriMAP1,
               y = TriMAP2, color = shape_label))+
    geom_point(alpha=0.5) +
    coord_equal() +
    theme(plot.title = element_text(hjust = 0.5, size = 18, face = "bold")) + #ggtitle("(a)") +
  theme_linedraw() +
    theme(legend.position = "none", plot.title = element_text(size = 7, hjust = 0.5, vjust = -0.5),
              axis.title.x = element_blank(), axis.title.y = element_blank(),
              axis.text.x = element_blank(), axis.ticks.x = element_blank(),
              axis.text.y = element_blank(), axis.ticks.y = element_blank(),
              panel.grid.major = element_blank(), panel.grid.minor = element_blank(), #change legend key width
        legend.title = element_text(size=5), #change legend title font size
        legend.text = element_text(size=4),
         legend.key.height = unit(0.25, 'cm'),
         legend.key.width = unit(0.25, 'cm')) +
  scale_color_manual(values=c("#b15928", "#1f78b4", "#cab2d6", "#ccebc5", "#fb9a99", "#e31a1c", "#6a3d9a", "#ff7f00", "#ffed6f", "#fdbf6f", "#ffff99", "#a6cee3", "#8dd3c7", "#ffffb3", "#bebada", "#fb8072", "#80b1d3", "#fdb462", "#b3de69", "#fccde5", "#d9d9d9", "#b2df8a", "#bc80bd", "#33a02c", "#ccebc5", "#ffed6f", "#000000", "#bdbdbd"))

tem_dir <- tempdir()

Fit_PacMAP_data(training_data |> dplyr::select(-c(ID, shape_label)), tem_dir)

path <- file.path(tem_dir, "df_2_without_class.csv")
path2 <- file.path(tem_dir, "dataset_3_PaCMAP_values.csv")

Fit_PaCMAP(as.integer(2), as.integer(10), "random", 0.9, as.integer(2), path, path2)

PaCMAP_data <- read_csv(path2)
PaCMAP_data <- PaCMAP_data |>
  mutate(ID = training_data$ID)

PaCMAP_data_with_label <- PaCMAP_data |>
  mutate(shape_label = training_data$shape_label)

PaCMAP_data_with_label |>
    ggplot(aes(x = PaCMAP1,
               y = PaCMAP2, color = shape_label))+
    geom_point(alpha=0.5) +
    coord_equal() +
    theme(plot.title = element_text(hjust = 0.5, size = 18, face = "bold")) + #ggtitle("(a)") +
  theme_linedraw() +
    theme(legend.position = "none", plot.title = element_text(size = 7, hjust = 0.5, vjust = -0.5),
              axis.title.x = element_blank(), axis.title.y = element_blank(),
              axis.text.x = element_blank(), axis.ticks.x = element_blank(),
              axis.text.y = element_blank(), axis.ticks.y = element_blank(),
              panel.grid.major = element_blank(), panel.grid.minor = element_blank(), #change legend key width
        legend.title = element_text(size=5), #change legend title font size
        legend.text = element_text(size=4),
         legend.key.height = unit(0.25, 'cm'),
         legend.key.width = unit(0.25, 'cm')) +
  scale_color_manual(values=c("#b15928", "#1f78b4", "#cab2d6", "#ccebc5", "#fb9a99", "#e31a1c", "#6a3d9a", "#ff7f00", "#ffed6f", "#fdbf6f", "#ffff99", "#a6cee3", "#8dd3c7", "#ffffb3", "#bebada", "#fb8072", "#80b1d3", "#fdb462", "#b3de69", "#fccde5", "#d9d9d9", "#b2df8a", "#bc80bd", "#33a02c", "#ccebc5", "#ffed6f", "#000000", "#bdbdbd"))

num_bins_x <- calculate_effective_x_bins(.data = tSNE_data, x = "tSNE1", hex_size = NA)
num_bins_x
[1] 33
num_bins_y <- calculate_effective_y_bins(.data = tSNE_data, y = "tSNE2", hex_size = NA)
num_bins_y
[1] 34
all_centroids_df <- generate_full_grid_centroids(nldr_df = tSNE_data, 
                                                 x = "tSNE1", y = "tSNE2", 
                                                 num_bins_x = num_bins_x, 
                                                 num_bins_y = num_bins_y, 
                                                 buffer_size = NA, hex_size = NA)


hex_grid <- gen_hex_coordinates(all_centroids_df)

full_grid_with_hexbin_id <- map_hexbin_id(all_centroids_df)

full_grid_with_polygon_id <- map_polygon_id(full_grid_with_hexbin_id, hex_grid)

tSNE_data_with_id <- assign_data(tSNE_data, full_grid_with_hexbin_id)

df_with_std_counts <- compute_std_counts(nldr_df = tSNE_data_with_id)

hex_full_count_df <- generate_full_grid_info(full_grid_with_polygon_id, df_with_std_counts, hex_grid)

ggplot(data = hex_full_count_df, aes(x = x, y = y)) +
  geom_polygon(color = "black", aes(group = polygon_id, fill = std_counts)) +
  geom_text(aes(x = c_x, y = c_y, label = hexID), size = 2) +
  scale_fill_viridis_c(direction = -1, na.value = "#ffffff")

ggplot(data = hex_grid, aes(x = x, y = y)) + geom_polygon(fill = "white", color = "black", aes(group = id)) +
  geom_point(data = tSNE_data, aes(x = tSNE1, y = tSNE2), color = "blue")

df_bin_centroids <- extract_hexbin_centroids(hex_full_count_df)
tr1_object <- triangulate_bin_centroids(df_bin_centroids, x = "x", y = "y")
tr_from_to_df <- generate_edge_info(triangular_object = tr1_object)
## To generate a data set with high-D and 2D training data
df_all <- training_data |> dplyr::select(-c(ID, shape_label)) |>
  dplyr::bind_cols(tSNE_data_with_id)

## To generate averaged high-D data

df_bin <- avg_highD_data(.data = df_all, column_start_text = "PC") ## Need to pass ID column name
## Compute 2D distances
distance <- cal_2d_dist(tr_from_to_df_coord = tr_from_to_df)

plot_dist(distance)
benchmark <- find_benchmark_value(distance_edges = distance, distance_col = "distance")
trimesh <- ggplot(df_bin_centroids, aes(x = x, y = y)) +
  geom_point(size = 0.1) +
  geom_trimesh() +
  coord_equal()

trimesh

trimesh_gr <- colour_long_edges(distance_edges = distance, benchmark_value = benchmark,
                                tr_from_to_df_coord = tr_from_to_df, distance_col = "distance")

trimesh_gr

trimesh_removed <- remove_long_edges(distance_edges = distance, benchmark_value = benchmark,
                                     tr_from_to_df_coord = tr_from_to_df, distance_col = "distance")
trimesh_removed

tour1 <- show_langevitour(df_all, df_bin, df_bin_centroids, benchmark_value = benchmark,
                          distance = distance, distance_col = "distance", column_start_text = "PC")
tour1

4 Conclusion

5 Acknowledgements

This article is created using knitr (Xie 2015) and rmarkdown (Xie et al. 2018) in R with the rjtools::rjournal_article template. The source code for reproducing this paper can be found at: https://github.com/JayaniLakshika/paper-quollr.

5.1 CRAN packages used

testthat, knitr, rmarkdown

5.2 CRAN Task Views implied by cited packages

ReproducibleResearch

H. Wickham. Testthat: Get started with testing. The R Journal, 3: 5–10, 2011. URL https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.
Y. Xie. Dynamic documents with R and knitr. 2nd ed Boca Raton, Florida: Chapman; Hall/CRC, 2015. URL https://yihui.name/knitr/. ISBN 978-1498716963.
Y. Xie, J. J. Allaire and G. Grolemund. R markdown: The definitive guide. Boca Raton, Florida: Chapman; Hall/CRC, 2018. URL https://bookdown.org/yihui/rmarkdown. ISBN 978-1138359338.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Lakshika, et al., "quollr: An R Package for Visalizing 2D Models in High Dimensional Space", The R Journal, 2024

BibTeX citation

@article{paper-quollr,
  author = {Lakshika, Jayani P.G. and Cook, Dianne and Harrison, Paul and Lydeamore, Michael and Talagala, Thiyanga S.},
  title = {quollr: An R Package for Visalizing 2D Models in High Dimensional Space},
  journal = {The R Journal},
  year = {2024},
  issn = {2073-4859},
  pages = {1}
}